28 July 2017

S0: Prior to lecture

Preparing for this lecture

  • Familiar yourself with the iris dataset. Typing iris into R console should load this data. Pay attention to its column, row names and structure of each column.

  • Please run these codes on your laptop,

## Might be a while...
install.packages(c("ggplot2","dplyr","tidyr","janitor","plotly",
                   "devtools","learnr","gapminder")) 

library(devtools)
install_github("kevinwang09/2017_STAT3914", subdir = "learnr3914")

S1: Necessary evils of Applied Statistics

Good statistical discoveries don't fall out from the sky

  • Statisticians are great at many things:

    1. Understanding data characteristics
    2. Building statistical/mathematical models
    3. Repeat 1 and 2…like…a lot…
    4. Extract insights
  • But the mother of all these, i.e. preparing data is not trivial. (e.g. STAT2xxx lab exams)

Let \(\boldsymbol{X}\) be the thing I want…

  • The real problem is not applying fancy shampoo for your cat. It is getting your cat into the bathtub.

Hidden side of being a statistician

  • Assume we have data
  • Assume we have cleaned data
  • Assume we can reveal aspects of the data
  • Assume we can perform statistics on the data
  • Assume we interrogated the right aspects of the data
  • Assume we did everything right, communicate insights to others

Summary of this lecture

  • Powerful tools for data preparation to allow rapid application of statistics.
  • Passive learning is not going to work.
  • S1: Introduction
  • S2: Reading in data using readr and readxl
  • S3: Basic data cleaning using janitor
  • S4: Clean coding using magrittr
  • S5: Data filtering using dplyr
  • S6: Data visualisation using ggplot2
  • S7: Conclusion

S2: Reading data

Better read/write data

  • base R functions haven't adapted to the needs of modern users.

  • readr functions are superior in data import warnings, column type handling, speed, scalability and consistency.

library(readr)

Reading data using readr

dirtyIris = read_csv("dirtyIris.csv")
class(dirtyIris)
[1] "tbl_df"     "tbl"        "data.frame"
dirtyIris
# A tibble: 250 x 6
  SepAl....LeNgth `Sepal.?    Width` `petal.Length(*&^` `petal.$#^&Width`
            <dbl>              <dbl>              <dbl>             <dbl>
1      5.80000000                2.6                4.0               1.2
2      5.80000000                2.7                5.1               1.9
3      0.01874617                 NA                 NA                NA
4              NA                 NA                 NA                NA
5              NA                 NA                 NA                NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
#   allEmpty <chr>
  • readxl and haven (for SAS, SPSS etc.) packages work similarly.
  • tibble is a data.frame with better formatting.
  • We now proceed to data cleaning on the dirtyIris dataset.

S3: Cleaned data

What is clean data?

Clean data is a data set that allows you to do statistical modelling without extra processing

  1. Good documentation on the entire data.
  2. Each column is a variable. The name should be informative, and:
    • No bad characters/formatting @KevinWang009
    • No inconsistent capitalisation or separators (Cricket_australia vs cricket.Australia)
  3. Each row is an observation:
    • No bad characters
    • No poorly designed row names (3, 2, 5, … )
    • No repeated row names (a, a.1, b, b.1, … )

Data cleaning in R

  • Clean data is a well-designed data.frame.

  • base R functions might struggle with all issues we mentioned.

  • Each column should be imported according to their most appropriate type.
    • e.g. Months should be read in as character or Date, not a factor ordered alphabetically (default in R).
  • Basic data cleaning using janitor package.

  • More advanced data manipulation through dplyr.

  • Our goal: clean the dirtyIris data to be exactly the same as the original iris data.

janitor: basic data cleaning

  • Clean up the bad column names
library(janitor)
dirtyIris
# A tibble: 250 x 6
  SepAl....LeNgth `Sepal.?    Width` `petal.Length(*&^` `petal.$#^&Width`
            <dbl>              <dbl>              <dbl>             <dbl>
1      5.80000000                2.6                4.0               1.2
2      5.80000000                2.7                5.1               1.9
3      0.01874617                 NA                 NA                NA
4              NA                 NA                 NA                NA
5              NA                 NA                 NA                NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
#   allEmpty <chr>
## Clean up column names
better = clean_names(dirtyIris) 
better
# A tibble: 250 x 6
  sepal_length sepal_width petal_length petal_width    species allempty
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>    <chr>
1   5.80000000         2.6          4.0         1.2 versicolor     <NA>
2   5.80000000         2.7          5.1         1.9  virginica     <NA>
3   0.01874617          NA           NA          NA       <NA>     <NA>
4           NA          NA           NA          NA       <NA>     <NA>
5           NA          NA           NA          NA       <NA>     <NA>
# ... with 245 more rows

janitor: removal of missing values

  • Purely empty rows/columns are non-informative.
## Removing empty rows/columns
evenBetter = remove_empty_rows(better)
evenBetter = remove_empty_cols(evenBetter)

evenBetter
# A tibble: 209 x 5
  sepal_length sepal_width petal_length petal_width    species
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>
1   5.80000000         2.6          4.0         1.2 versicolor
2   5.80000000         2.7          5.1         1.9  virginica
3   0.01874617          NA           NA          NA       <NA>
4   6.50000000         3.0          5.5         1.8  virginica
5   6.30000000         2.9          5.6         1.8  virginica
# ... with 204 more rows
  • Genuinely missing values should be retained, but in this case, the NA's were added.

  • Removing any rows with NA

evenBetterBetter = remove_missing(evenBetter) 
cleanIris = evenBetterBetter
  • We now have agreement over the size of the two data
glimpse(cleanIris)
Observations: 150
Variables: 5
$ sepal_length <dbl> 5.8, 5.8, 6.5, 6.3, 5.1, 6.2, 5.9, 6.8, 5.0, 6.7,...
$ sepal_width  <dbl> 2.6, 2.7, 3.0, 2.9, 3.8, 2.8, 3.0, 3.2, 3.0, 3.0,...
$ petal_length <dbl> 4.0, 5.1, 5.5, 5.6, 1.9, 4.8, 4.2, 5.9, 1.6, 5.0,...
$ petal_width  <dbl> 1.2, 1.9, 1.8, 1.8, 0.4, 1.8, 1.5, 2.3, 0.2, 1.7,...
$ species      <chr> "versicolor", "virginica", "virginica", "virginic...
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
$ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

S4: Clean coding (Skim through)

Coding complexity increases with the number of brackets

  • The "inside out" structure of coding isn't great for human reading.
mean(cleanIris$sepal_length)
[1] 5.843333
plot(density(cleanIris$sepal_length), col = "red", lwd = 2)

Piping: read code from left to right

  • We introduce a new notation: " x %>% f " means "f(x)". We call this operation as "x pipe f".

  • Compounded operations are possible. Keyboard shortcut is Cmd+shift+M.

cleanIris$sepal_length %>% mean
[1] 5.843333
cleanIris$sepal_length %>%
  density %>%
  plot(col = "red", lwd = 2)

S5: dplyr: data subsetting master

Traditional way of subsetting data in R (optional)

  • If I want the first two rows of column sepal_length and sepal_width in the cleanIris data:
## Assuming you know the position of column names.
## But what if you resample your data?
cleanIris[1:2, c(1, 2)]

## Assuming you know the position of column names.
## Also assuming the first two columns satisfy certain properties.
cleanIris[1:2, c(T, T, F, F, F)]

## Much better!
## What if you can't type out all the column names
## due to the size of your data?
cleanIris[1:2, c("sepal_length", "sepal_width")]
  • Something more realistic:
cleanIris[(cleanIris[,"sepal_length"] < 5) &
            (cleanIris[,"sepal_width"] < 3), c("petal_length", "sepal_length")]
# A tibble: 4 x 2
  petal_length sepal_length
         <dbl>        <dbl>
1          1.3          4.5
2          1.4          4.4
3          3.3          4.9
4          4.5          4.9
  • (Optional) A pro R user might know about the subset function, but it suffers the same problem of not able to have multiple subsetting criteria without predefined variables.
subset(cleanIris,
       subset = (sepal_length < 5) & (sepal_width < 3),
       select = grep("length", colnames(cleanIris), value = TRUE))
# A tibble: 4 x 2
  sepal_length petal_length
         <dbl>        <dbl>
1          4.5          1.3
2          4.4          1.4
3          4.9          3.3
4          4.9          4.5

Subsetting data using dplyr

  • Think of subsetting rows and columns as two separate different procedures:
  • select columns are operations on variables.
  • filter rows are operations on observations.

  • See dplyr cheatsheet.

library(dplyr)

cleanIris %>%
  filter(sepal_length < 5, sepal_width < 3) %>%
  select(contains("length"))
# A tibble: 4 x 2
  sepal_length petal_length
         <dbl>        <dbl>
1          4.5          1.3
2          4.4          1.4
3          4.9          3.3
4          4.9          4.5

dplyr is much more!

  • mutate creates new columns
iris_mutated = mutate(cleanIris,
      newVar = sepal_length + petal_length,
      newNewVar = newVar*2)

iris_mutated
# A tibble: 150 x 7
  sepal_length sepal_width petal_length petal_width    species newVar
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>  <dbl>
1          5.8         2.6          4.0         1.2 versicolor    9.8
2          5.8         2.7          5.1         1.9  virginica   10.9
3          6.5         3.0          5.5         1.8  virginica   12.0
4          6.3         2.9          5.6         1.8  virginica   11.9
5          5.1         3.8          1.9         0.4     setosa    7.0
# ... with 145 more rows, and 1 more variables: newNewVar <dbl>
  • group_by + summarise will create summary statistics for grouped variables
bySpecies = cleanIris %>%
  group_by(species)

bySpecies
# A tibble: 150 x 5
# Groups:   species [3]
  sepal_length sepal_width petal_length petal_width    species
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>
1          5.8         2.6          4.0         1.2 versicolor
2          5.8         2.7          5.1         1.9  virginica
3          6.5         3.0          5.5         1.8  virginica
4          6.3         2.9          5.6         1.8  virginica
5          5.1         3.8          1.9         0.4     setosa
# ... with 145 more rows
bySpecies %>%
  summarise(meanSepalLength = mean(sepal_length))
# A tibble: 3 x 2
     species meanSepalLength
       <chr>           <dbl>
1     setosa           5.006
2 versicolor           5.936
3  virginica           6.588
  • left_join for merging data
flowers = data.frame(species = c("setosa", "versicolor", "virginica"),
                     comments = c("meh", "kinda_okay", "love_it!"))

## cleanIris has the priority in this join operation
iris_comments = left_join(cleanIris, flowers, by = "species")

## Randomly sampling 6 rows 
sample_n(iris_comments, 6) 
# A tibble: 6 x 6
  sepal_length sepal_width petal_length petal_width    species   comments
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>     <fctr>
1          6.2         2.8          4.8         1.8  virginica   love_it!
2          6.7         3.3          5.7         2.1  virginica   love_it!
3          4.6         3.2          1.4         0.2     setosa        meh
4          5.4         3.0          4.5         1.5 versicolor kinda_okay
5          7.7         3.8          6.7         2.2  virginica   love_it!
6          5.8         2.7          5.1         1.9  virginica   love_it!

Checking if we cleaned the data properly

  • arrange for ordering rows
arrangeCleanIris = cleanIris %>% 
  arrange(sepal_length, sepal_width, petal_length, petal_width)

## The true iris data
arrangeIris = iris %>% 
  clean_names() %>% 
  arrange(sepal_length, sepal_width, petal_length, petal_width)
  • We sorted both the processed dirtyIris data and the arranged iris data.
## The `Species` column is character or factor
all.equal(arrangeCleanIris, arrangeIris) 
[1] "Incompatible type for column `species`: x character, y factor"
arrangeIris = arrangeIris %>% 
  mutate(species = as.character(species)) 

## Great! 
all.equal(arrangeCleanIris, arrangeIris)
[1] TRUE
## identical is stronger version of all.equal
## arrangeCleanIris is a tibble
## but arrangeIris is a data.frame
identical(arrangeCleanIris, arrangeIris) 
[1] FALSE
class(arrangeCleanIris)
[1] "tbl_df"     "tbl"        "data.frame"
class(arrangeIris)
[1] "data.frame"
## Now, they are exactly the same!
identical(arrangeCleanIris, arrangeIris %>% as.tibble)
[1] TRUE

dplyr special select functions (advanced)

  • select only if certain string is present
cleanIris %>% 
  select(contains("length"))
# A tibble: 150 x 2
  sepal_length petal_length
         <dbl>        <dbl>
1          5.8          4.0
2          5.8          5.1
3          6.5          5.5
4          6.3          5.6
5          5.1          1.9
# ... with 145 more rows
  • select only if a column satisfy a certain condition
bySpecies %>%
  summarise_if(is.numeric,
               funs(m = mean))
# A tibble: 3 x 5
     species sepal_length_m sepal_width_m petal_length_m petal_width_m
       <chr>          <dbl>         <dbl>          <dbl>         <dbl>
1     setosa          5.006         3.428          1.462         0.246
2 versicolor          5.936         2.770          4.260         1.326
3  virginica          6.588         2.974          5.552         2.026

S6: ggplot2: the best visualisation package

ggplot2 tutorial sheet

  • If you managed to install all packages successfully, you should be able to run the following to get an interactive tutorial sheet.

  • Otherwise, please download and compile the "ggplot2_basic_tutorial.Rmd" here

library(learnr3914)
learnggplot2()

ggplot2: the philosophy

  • Di Cook - the real reason that you should use ggplot2 is because its design will force you to use a certain grammar when producing a plot. So much so, every plot you produce is actually a statistic.
library(ggplot2)
p1 =
  iris %>%
  ggplot(aes(x = Sepal.Length,
             y = Sepal.Width,
             colour = Species)) +
  geom_point(size = 3)

p1 

S7: Conclusion

Homework: practise on real data

  • Here is a dataset.
  1. Read the gene-set data into R using a package you found online.
  2. What is the R class of this data? Is it a data.frame?
  3. How many genes are in each gene-set?
  4. What is the mostly frequent mentioned 6 genes?

tidyverse: tidy data + tidy coding

  • tidyverse is a collection of 20+ packages built on the philosophy that both your data and coding should be tidy.

  • These functions:
    • They integrate against each other well.
    • They are NOT ad-hoc programming solutions.
    • They were built with a philosophy of being "tidy" and "reproducible" in mind.
    • They will always throw errors at you, if you don't have a thorough understanding of your data.

A gallery (optional)

Advice in the future

  • Computational tools and organisation.

  • Use RStudio + RMarkdown to document your codes.

  • You don't have to adopt everything I taught you today!

  • Find the component that you like to use and adapt that into your work routine. (Hint: start with cheatsheets and build up gradually.)

  • Take time to re-analyse an old dataset.

  • Learn core functions and vignette.

  • Don't forget the theories and interpretations! This is a course about statistics after all, not Cranking-Out-Numbers-Less-Than-0.05-And-Reject-Null-Hypothesis-101.

Session Info and References

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2    janitor_0.3.0   plotly_4.7.0    dplyr_0.7.2    
 [5] purrr_0.2.2.2   readr_1.1.1     tidyr_0.6.3     tibble_1.3.3   
 [9] ggplot2_2.2.1   tidyverse_1.1.1 knitr_1.16     

loaded via a namespace (and not attached):
 [1] reshape2_1.4.2    haven_1.1.0       lattice_0.20-35  
 [4] colorspace_1.3-2  htmltools_0.3.6   viridisLite_0.2.0
 [7] yaml_2.1.14       rlang_0.1.1       foreign_0.8-69   
[10] glue_1.1.1        modelr_0.1.0      readxl_1.0.0     
[13] bindr_0.1         plyr_1.8.4        stringr_1.2.0    
[16] munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0 
[19] rvest_0.3.2       htmlwidgets_0.9   psych_1.7.5      
[22] evaluate_0.10.1   labeling_0.3      forcats_0.2.0    
[25] httpuv_1.3.5      crosstalk_1.0.0   parallel_3.4.1   
[28] broom_0.4.2       Rcpp_0.12.12      xtable_1.8-2     
[31] scales_0.4.1.9002 backports_1.1.0   jsonlite_1.5     
[34] mime_0.5          mnormt_1.5-5      hms_0.3          
[37] digest_0.6.12     stringi_1.1.5     shiny_1.0.3      
[40] grid_3.4.1        rprojroot_1.2     tools_3.4.1      
[43] magrittr_1.5      lazyeval_0.2.0    pkgconfig_2.0.1  
[46] data.table_1.10.4 xml2_1.1.1        lubridate_1.6.0  
[49] assertthat_0.2.0  rmarkdown_1.6     httr_1.2.1       
[52] R6_2.2.2          nlme_3.1-131      compiler_3.4.1